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Abstract 

In this paper we study properties of the sample covariance matrix E n as an esti- 
mator of p x p population matrices £ of reduced effective rank. The effective rank 
r e (E) of a matrix is the ratio of its trace to its largest singular value, and provides 
a measure of matrix complexity. Despite the very large body of work on covariance 
matrix estimation, the properties of S n over classes of population matrices of reduced 
r e (S) are largely unexplored. Our first contribution is to review and establish sharp 
finite sample bounds on the operator and Frobenius norm of S n — S. These bounds 
reveal that, as long as ?" e (£) < n, the sample covariance matrix S n , can still serve as an 
accurate estimator of E, even if p > n. Moreover, and perhaps surprisingly, £„ adapts 
to the unknown complexity of £ quantified by r e (£), without any need for further 
thresholding operations. Our main contribution is in employing these results for the 
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study of the consistency of scree-plot selection procedure routinely used in PCA. For 
given p and n, we quantify the largest number of the eigenvalues of £ that can be 
consistently estimated by thresholding the spectrum of £ n , and offer a data adaptive 
construction of the threshold level. We derive the finite sample rates of convergence 
for the selected eigenvalues. We show that the analysis of the selected eigenvectors 
requires further assumptions on the spectrum of E, and discuss their implications on 
the construction of thresholding levels. As an application, we consider aspects of func- 
tional principal components analysis (fPCA). When the data consists of a sample of 
discretely observed trajectories of a stochastic process, we study theoretically, in finite 
samples, the scree plot method for selecting principal components constructed from 
the sample covariance matrix. We quantify the accuracy of the selected sample eigen- 
values and eigenvectors, in finite sample. This complements the existing asymptotic 
analyses in which a pre-specified and fixed number of components are analyzed and 
the theoretical study of their selection is left open. 



1 Introduction 

High dimensional covariance matrix estimation has received a high amount of attention over 
the last few years. This is largely motivated by the fact that the sample covariance matrix 
£ n , based on a sample of size n, is not necessarily a consistent estimator of the covariance 
matrix E of a random vector X G M p , if p > n. In this regime, the shortcomings of E n 
have been well understood for over a decade, whenever we estimate a spiked covariance 



2| and Johnstone [13] 



matrix; see, for instance, the seminal works of Baik and Silverstein 
By definition, spiked models have a fixed number of large eigenvalues and the rest equal 
to one. Therefore, the effective number of parameters in such models is of order p 2 , and 
there is no hope to estimate them accurately from a small sample. To address this issue, 
classes of sparse covariance matrices have been introduced in recent years. Depending on the 
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type of sparsity (entry- wise, row-wise, off-diagonal decay), appropriate estimators have been 
introduced and shown to adapt to the unknown sparsity structures, see, for instance, Bickel 
and Levina 4|, |5], Cai et al. 8], Cai and Liu 7|, among many others. It is important to note 
that although sparse matrices, by definition, have a reduced number of parameters, they 
can still be spiked. Therefore, the usage of the sample covariance matrix £ n in this context 
would still be questionable, in addition to not rendering the appropriate sparse structure. It 
is also of importance to observe that all sparse covariance matrix models carry with them 
implicit modeling assumptions. For instance, they are appropriate whenever many of the 
components of X are weakly correlated. They are also powerful for modeling temporally or 
spatially ordered variables, in cases where it is reasonable to assume that variables that are 
apart in time or space have very little association. 

However, there are many instances where these assumptions are not satisfied, for example 
when the observed variables are known to have strong associations with each other. If the 
association is approximately linear, £ will be close to being a degenerate, rank r < p matrix, 
with possibly much fewer parameters than p 2 , if r is small. To treat general, positive definite 
covariance matrices, which have effectively reduced rank, we make use of the notion of 



effective rank, first suggested by Vershynin [22J and given by 



r.(S) = ^gP- (1.1) 

Here ||£||2 denotes the operator norm, or the largest singular value, of E. Clearly, r e (S) is 
smaller than the rank for degenerate matrices and, in general, it can be significantly smaller 
than p if a large number of eigenvalues of £ are relatively small. 

Perhaps surprisingly, the properties of the sample covariance matrix over classes of pop- 
ulation matrices of reduced effective rank are largely unstudied, with the few exceptions 
we discuss below. Our first contribution is to show that, for sub-Gaussian distributions, 
the sample covariance estimator S n remains an accurate estimator of population matrices £ 
with r e (E) < n, up to logarithmic factors, even if p > n. For this we establish sharp upper 



3 



bounds, on (i) the Frobenius norm ||E n — E||j? and (ii) operator norm ||E n — E||2. The analysis 
of || E n — E|| 2 extends to distributions with unbounded support the results, similar in spirit, 
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section 5.4.3] and Lounici 



For the 



that have been recently derived by Vershynin 
derivation of upper bounds on ||E n — E||p we employ concentration inequalities developed by 
Juditsky and Nemirovski [jj]] for averages of matrices belonging to general K-smooth spaces. 
Our results are derived in Section 12.21 and summarized in Table [j] below. To place these 
results in context, consider for instance the bounds on Frobenius norm, in either regime for 
p. To ease the interpretation of our results, we rewrite our bound in terms of the commonly 
used scaled squared Frobenius norm: 

^,,9 ^ii^nQ riE) Inn Inn 

- E„ - E\ 2 F < E r e E Ai < E r e E , 

p p n n 

where < is used for inequalities that hold up to multiplicative constants. If E is indeed 
spiked, with no decay of its smaller eigenvalues, then r e (E) = 0(p) and the bound cannot 
be small if p > n, confirming all existing results. However, if the effective rank r e (E) = 
o(n), up to logarithmic factors, and ||S|| 2 is appropriately bounded, then E n provides an 
accurate estimator, even if p > n. The same comparison can be made with respect to the 
operator norm. This makes it clear that the size of r e (E) relative to n, rather than that of 
p, up to logarithmic factors, governs the accuracy of E n as an estimator of E. Moreover, 
it is noteworthy that, in terms of rates of convergence, the simple estimator E n adapts 
directly to the unknown complexity of E, as measured by r e (E), without any need for further 
thresholding or shrinking operations. 

Since whenever r e (E) is appropriately small, as above, the usage of E n is justified, then 
principal component analysis (PCA) based on E n is also justified, even if p > n. In this 
context, our main contribution, presented in Section [3l is to provide a finite sample analysis 
of the popular scree-plot method for selecting principal components. The method consists 
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Norm/ Values of p 


p = 0(n a ),a > 


v = OiexpfnH 


Frobenius: ||E n — E|| F 


||E|| 2 -r e (E)- 


||E|| 2 T e (E)- 


Operator: ||E n — E| 2 


||E|| 2 -r e (E).^, ifr e (S)>^ 


||E|| 2 T e (E)-^ 


l|S|| a • • ifr e (E)<^ 



Table 1: Optimal rates for the Frobenius and operator norm of E„ — £: orders of magnitude 
depending on the regime of p. Within each regime, the size of 7" e (£) relative to n dictates 
the final rate. 

in first choosing a threshold level 77 > and determining 



k{r 1 ):=Y J ^ftk>r l ), (1.2) 

where the eigenvalues of E n are denoted by A^. Then, one retains A& and the corresponding 
eigenvectors ij) k of E n , with k < k(r]), for further inference. To the best of our knowledge, 
there is no theoretical study of this method. We begin our analysis by answering the following 
open question: what does k(rf) given by (11. 2p estimate and how does the estimate depend 
on the choice of 77 ? We provide a formal answer in Section 13.11 If denotes the k-th 
eigenvalue of E, we introduce the relative rank of E, r(r]), as the largest index k for which 
Afc is larger than 77. We show, in Theorem 13. 1[ that k(r]) = r(r]), with high probability, as 
expected. The important new element is the theoretical quantification, in finite samples, of 
the minimal order of the threshold level for which this happens. We show that it is 77 = 77 min , 
where 77 m j n > ||E n — E|| 2 , with high probability. From the bounds on ||E n — E|| 2 given in 
Table [1] we can therefore deduce the order of magnitude of ?7 m in, for a given configuration 
of n and p. In particular, whenever r e (E) and ||E|| 2 are bounded independently of n and p, 
Vmin — O (^y/\nn/n^ . We show in Section H] that matrices E with these particular features 
arise naturally in the context of functional principal component analysis, fPCA. In general, 
once the form of ?7 m i n is identified, we show in Theorem 13.21 that we can construct a fully 
data driven threshold level rj n such that k(rj n ) = r(rj min ), with high probability. 
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As an immediate consequence, we show in Section 13.31 for matrices E with appropriately 



bounded ||S|| 2 and r e (E), that 



At — At 



0(\J\ogn/n), for each k < k(r] n ), with high 
probability. Theorem 13.31 of Section 13.31 shows that, however, the corresponding sample 
eigenvectors ij) k are not necessarily close to the theoretical t/j k , for all these values of k, and 
that further information on the degree of separation of the spectrum of E is needed for this 
analysis. The problem becomes especially difficult when the eigenvalues of E themselves are 
small, which is typically the case in fPCA, as explained in SectionlH With a view towards this 



application, we focus, in Section 1331 on the analysis of matrices E with eigenvalues that have 
appropriate polynomial decay, such that that r e (E) and ||E|| 2 are bounded independently of 
n and p. In this case we give an explicit expression of the theoretical threshold rj* > r/ min , 



and construct a data driven threshold 77* such that k(r]*) = r(rj*) and 
for each k < fc(?7*), with high probability. 



0(1) 



In SectionHJwe show how our results can be employed in fPCA. We will study in detail this 
problem when data consists in a sample of n independent trajectories Xi(t) of a background 
stochastic process X(t) with covariance operator tC. Forperfectly observed trajectories, at all 



time points t and without additive noise, Dauxois et al. 



9j proved that the sample eigenvalues 



and eigenfunctions attain the parametric rate of convergence. In the same context, Hall and 



Hosseini-Nassb 10| developed these results further to obtain bootstrap based confidence 
intervals. When data consists in discretely observed trajectories, corrupted by additive 



noise, a large number of smoothing techniques have been suggested, see Bunea et al. 



6] 



for an overview. In this setting, the theoretical properties of the estimated eigenvalues and 



231 ]. Hall et al. ll| and Benko 



eigenvectors have been established by, for instance, Yao et al. 
et al. j3( . The theoretical analyses presented in all these works have in common the following 
element: the number of eigenvalues and eigenfunctions to be estimated and analyzed is fixed 
in advance, and is independent of n and of the number of sampling points per trajectory. Our 
contribution is to determine, from the data, how many of the principal components can be 
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reliably estimated. To keep our presentation focused and show how results from covariance 
matrix estimation can be brought into fPCA, we restrict our attention to estimates based on 
the sample covariance matrix relative to trajectories that are all sampled at the same m fixed 
time points. Our analysis can be regarded as intermediate between those conducted for fully 
observed trajectories and those that employ smoothing techniques. We give precise finite 
sample bounds on the quality of the sample eigenvalues and eigenvectors thus obtained in 
Theorem 14.21 Our results show that, for m — > oo, we recover the parametric convergence rate 
for all estimated eigenvalues, up to logarithmic factors. The same rate is valid for the first few 
eigenvectors, confirming all existing results. However, as the number of eigenvectors to be 
analyzed is allowed to depend on m and n, this may stop being true. We present this analysis 
in Theorem 14.21 of Section 4.3. Moreover, in Section 4.2 we analyze the scree-plot selection 



19] for an overview in the functional data 



criterion ( II. 2p . see for instance Ramsay et al. 
context. Its impact on the quality of the sample eigenvalues and eigenvectors is discussed in 
Theorem 14.31 and the results mirror those of Section [3731 and the discussion above. 

The proofs of all our theoretical results are given in the Appendix. 



1.1 Notation 

We shall use the following notation throughout our paper: || • \\p, the Frobenius norm; || • 1 1 2 , 
the spectral/operator norm; || • the nuclear norm; || • ||, the Euclidean norm of a vector; 
tr(-), the trace of a square matrix; I p , an identity matrix of dimension p; psd, positive semi- 
definite; for a matrix A denote by dk(A) the fcth largest singular value of A. We will also 
use the notation < for inequalities that hold up to multiplicative constants independent of 
n and p (or m). 
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2 Some inequalities for the sample covariance matrix 



2.1 Sub-Gaussian distributions 

All the results of this paper are proved for a certain class of sub-Gaussian distributions. 
In particular they all hold for Gaussian vectors or processes. We define this class below. 
To simplify notation, we assume in this section that all random variables or vectors have 
zero mean. We recall that a zero-mean random variable I 6 1 is sub-gaussian if there 
exists a constant o > such that Eexp(tX) < exp(t 2 cr 2 /2), for all tel. Then it can be 
shown that sup fc>1 k~ 1 ^ 2 (K\X\ k ) 1 ^ k < oo and the sub-Gaussian norm of X is defined to be 
||X||^ 2 = sup fe>1 k~ l l 2 (E\X\ k ) l l k . A zero-mean random vector X G W is sub-Gaussian if 
for any non-random u G W, u'X is sub-Gaussian. The sub-Gaussian norm of X is defined 
as ||A||^, 2 = sup ugRP \j } ||t/X||,0 2 /||ii||. We will impose an additional assumption on a sub- 
Gaussian random vector: 

Assumption 1. For a zero-mean sub-Gaussian random vector X G W , we assume that 
there exists a constant cq > such that E(VX) 2 > co||m'A|| 2 2 for all u G W . 

The above assumption effectively bounds the higher moments of X as polynomial functions 
of the second moments of X. Let E be the covariance matrix of X, then u'Y>u > collw'XH^, 
for all u G MP, under Assumption 1. We will provide a number of distributions of interest 
that meet this assumption below. Before that we point out that if X G M? is sub-Gaussian 
and satisfies Assumption 1 and O G MP xp is an orthonormal matrix, then OX is also sub- 
Gaussian and satisfies Assumption 1 with the same Cq. 

Example 2.1. Let X = (A 1; . . . , X p )' and the components Xj are independent and have a 
zero-mean sub- Gaussian distribution. Specifically, suppose there is a common constant a > 
such that Eexp [tXjj \J^jj) < exp(t 2 <7 2 /2) for all j, where Hjj is the variance of Xj. Then 
X is sub-Gaussian and satisfies Assumption 1. 
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Example 2.2. Let X be defined as in Example \2.1\ Let X = OX where O £ ~R pxp is an 
orthonormal matrix. Then X is sub-Gaussian and satisfies Assumption 1. 

A proof of the statements above is provided in Appendix IA.ll 

2.2 Bounds on the Frobenius norm, operator norm and trace of 
the sample covariance matrix as functions of r e (£) 

In this section we revisit properties of the sample covariance matrix. Let Xi,...,X n be 
i.i.d. observations of a random vector X £ IR P . Without loss of generality, we assume that 
E(A) =: fi = 0, otherwise all our results carry over unchanged with Xi replaced by Xi — \l. 
Let X = n~ x 'Y^l=\ Xi and E n = H* -1 X^=i(Ai — X)(Xi — X)' be the sample covariance 
matrix. We establish below sharp probability upper bounds on £„ — E, in terms of both 
the Frobenius and the operator norms. As announced in the Introduction, we show that the 
effective rank, r e (S) = -py^, governs the rates of convergence. For our analysis, we write 
S n = E* n - XX', where E* n = n~ l £™ =1 XtXl. We shall study E* - E and XX' separately. 

We begin with the study of XX'. Since this is a rank 1 matrix, we make use of the basic 
fact 1 1 XX' 1 1 f = || AX' || 2 = || X || 2 . The following proposition is instrumental in the proof of 
Theorems 12.11 and 12.2^ and may be of interest in its own right. 

Proposition 2.1. Let Assumption 1 hold. There exist two fixed positive constants C, c such 
that, if \t\ > c(4c ' 1 + l)tr(E), 




Theorem 2.1. Let Assumption 1 hold. With probability at least 1 — 3n 



A|| 2 < Cl -||E|| 2 -r e (E)- 



Inn 



n 
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Remark 2.1. All proofs for this section are provided in Appendix I A. II As the following 
results show, the upper bound in Theorem 12. II is of order no bigger than that of ||E* — E||j? 
and || E* — E|| 2 , respectively. 

Next we study E* - E. Let = - E. Then E(^) = and E* - E = n' 1 Z i- We 
begin by stating the bounds with respect to the Frobenius norm. 

Theorem 2.2. Let Assumption 1 hold. The following inequality holds with probability at 
least 1 — vT 1 , 

IK- S || F <c 2 .||E|| 2 .ME).^5< T ' + 8 ^ 
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^/2exp(l) + 8 Vln 2n In 



n n 



where c 2 = 2 max ^v^C*, (4c 1 + 1) . 

Combining Theorem 12. II and Theorem 1 2. 2 1 we obtain the upper bound for ||E n — E||j?. 
Theorem 2.3. Let Assumption 1 hold. With probability at least 1 — An -1 , 

||E n - E\\ F < c 2 ■ ||S|| 2 • r e (E) 

where c 2 is defined as in Theorem \2.S[ 

The following results assess the accuracy of E„ with respect to the operator norm. 
Theorem 2.4. Let Assumption 1 hold. With probability at least 1 — n~ l , 

||S* - E|| 2 < c 3 • ||E|| 2 • max 



r e (E) • Inpn r e (E) • Inpn 



2n n 
where C3 is a fixed constant that depends only on cq. 

Combining this result with Theorem 12.11 above we obtain: 
Theorem 2.5. Let Assumption 1 hold. With probability at least 1 — 4n _1 , 
||E n - E|| 2 < (ci + c 3 ) • ||E|| 2 ■ max 



r e (E) • Inpn r e (E) • Inpn 



2n n 

where c\ is defined as in Theorem \2.1\ and C3 is defined as in Theorem \2.4 

10 



Next we provide a theorem that shows that the trace of the sample covariance matrix is 
concentrated around the trace of the population covariance matrix, with high probability. 
This result is instrumental in deriving data dependent thresholds in Section 3.3 below. 

Theorem 2.6. Let c\ and c 2 be defined as in Theorems \2.1\ and \2.3l then with probability at 
least 1 — Qn~ x , 

\ , . . / Inn /inn \ 

\tr(E n )-*r(E < U + c 2 J • tr(E). 

\ n V n / 

Remark 2.2. (i) As it can be seen from the proofs in Appendix lA.lt all our results continue 
to hold if E is singular. 

(ii) Theorem 12.51 makes it clear that if p > n and also ?" e (E) > n/hipn, the bound on 
||E n — 51] 1 1 2 cannot be close to zero. However, even if p > n is large, but r e (E) < n/\npn, we 
revert back to a fast rate. Specifically, and up to multiplicative constants, 



In pn 



E n -E|| 2 < ||E|| a Ve(S)W-^. (2.1) 

V n 

Therefore, as long as ||E|| 2 -\/r e (E) = o {^\/n/ In pn^j , the sample covariance matrix E n will 
be close to E in operator norm, with high probability. 

(iii) The explicit dependency on p under the logarithmic term in ( 12. ID above makes this 
bound unusable if p grows exponentially fast with n or if p — > oo independently of n. The 
latter situation is of particular interest for the analysis of functional data, as we describe in 
detail in Section 4. Fortunately, the simple inequality ||M||2 < ||M||ir, for any matrix M, 
coupled with Theorem 12.31 also yields, with probability larger than 1 — 4n _1 , 



l|S„-E|| 2 < ||E|| 2 T e (E)-^, (2.2) 

which is the fastest possible rate in these situations, as the explicit dependency on p is 
removed. 

(iv) The rates given by Theorems 12.51 and 12.31 above are minimax optimal over the class 
of matrices with effective rank bounded by min(i/n, p), up to logarithmic terms. We refer 
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to Theorem 2 of Lounici 18j for the lower bound derivations with respect to the operator 
norm. The lower bound with respect to the squared Frobenius norm derived in Theorem 2 



of Lounici 



18] is of the order of ■ r e (£) • p/n, which is larger than the rate we derived in 



18( can be tightened, by keeping 



Theorem 12.31 However, the proof of Theorem 2 in Lounici 
only the first line of his inequality (5.13), to show that minimax lower bound is in fact 
• rg(£)/n. Therefore, our rate is near minimax optimal, over the class of matrices with 
effective rank bounded by min (^/n,p). The study of minimax optimality for larger effective 
ranks is deferred to future work. 



3 Selecting the number of principal components 
in finite samples 

In the previous section we showed that, if r e (S) is appropriately small relative to n, the 
sample covariance matrix E n provides a viable and rate optimal estimator of S, even if p > n. 
In many applications, notably in any that involve PCA, the properties of S n itself are not 



the main focus, and one studies instead features of it. See Jolliffe 14j for a comprehensive 
study of PCA. To the best of our knowledge, no finite sample analysis of the properties of 
the selected number of components and of the corresponding eigenvalues and eigenvectors 
has been conducted. The existing results by Anderson jlj] and Srivastava and Khatri 20] 



are of asymptotic nature, and none makes explicit use of the crucial role played by r e (E) 
in this problem. Moreover, the popular selection method given by (11. 2p described in the 
Introduction has not been studied theoretically. This motivates our study of this problem. 
We first introduce the notation employed throughout this section. Let > and t/> fc , 
1 < k < p, be, respectively, the eigenvalues and eigenvectors of S. Similarly, X k and ij) k 
are the eigenvalues and eigenvectors of the sample covariance matrix E n . The sign of i\) k is 
selected so that i(^ k ipk > 0- 
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3.1 Detectable relative ranks of a population covariance matrix 

Let /j, > be a given threshold value. The number of top eigenvalues of E n relative to fi is 

k(fx) =: max{k : > fi}. (3.1) 

We expect this quantity to estimate the index of an eigenvalue of E where a drop in value 
occurs. To formalize this, we introduce a quantity that is key to our analysis, the relative 
rank of a matrix. 

Definition 1 (Relative rank). Let r\ > be given. We call an index s the relative rank of E 
at t] if 

K > (1 + S)rj 7 and \ s+1 < (1 - 5) V , (3.2) 

for some 5 G (0, 1]. We say that E has a spectral jump of size 26r] at 77. We denote an index 
s satisfying (13.21) by r{rf). 

Definition 2 (Detectable relative rank). For given rj > 0, we say that the relative rank r(q) 
is detectable via E n if there exists a threshold value \x > such that 

hti=r(v), (3-3) 

with high probability. 

Theorem 13.11 shows what determines the minimum value of rj or, equivalently, the maximum 
value of the relative rank r{rj) that is detectable via E n , together with a choice for /i. 

Theorem 3.1. Let t] > be given and assume that A3. 2]) holds. Set fi = t] and let k(rj) be 
given by A3. Then 

P{fc(77) = r(7j)} >l-P(||E n -E|| 2 >5ri). 

Theorem 13.11 makes it clear that the order of magnitude of 77 =: 77 m i n must be above the 
order of magnitude of ||E n — E||2, which we have determined in Section 12721 We employ the 
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optimal bounds on ||E„ — 51] 1 1 2 derived in Theorem 12.51 to determine ?7 m i n if p grows at most 
polynomially with n, and result ( 12. 2p . if p grows exponentially with n. Specifically, let 



J IrjY) ■ lnpn r e (S) • lnpn ■ 
^ =: (ci + c 3 )- ||£|| 2 -max< y — , }, 

and 

( v^x^iy+sVhT^ Inn 
r/2 =: c 2 • ||£|| 2 • r e (S) ■ <^ -= + — } , (3.5) 



where Ci,c 2 and C3 are defined in the theorems of Section [2.21 Corollary 13.11 below shows 
that by retaining only the eigenvalues in S n that are above these values of 77 we estimate 
consistently r(rj). 

Corollary 3.1. (1) Suppose that p = O {n 1 ), for some 7 > 0, and that X satisfies Assump- 
tion 1. If A3.2\) holds with rj > 771 and 5 = r]~ 1 r]i , 

p |jfe(77i) = r(rji) J > 1 - 4n _1 . 

Suppose that p = 0{exp(n)} and that X satisfies Assumption 1. If K3.2t) holds with 
77 > 7/2 and 5 = rf x r\<i, 

P |^(?72) = r(r/ 2 )| > 1 - 4n _1 . 



3.2 Data driven thresholds 

In this section we show that r(r] min ), the rank of £ relative to n min given by either either f!3.4[) 
or ( 13. 5 p above can be estimated consistently by k(n) defined by (13. ip . when a is replaced, 
respectively and up to universal constants, by the data dependent sequences below: 



jE n || 2 • tr(S n ) • \npn tr(E n ) • \npn . 

2rj hn = 2(ci + c 3 ■ max ^ A / , v , , , - , ? , 3.6 

2n(l + ei)(l + e 2 ) n(l + ei) 



and 



trgn) J v /2exp(l) + 8v / hT2^ Inn 

2r/ 2 , n = 2c 2 ■ — <^ -= + } , 3.7) 

1 + ei Jn n 



where e\ = C\h\n/n + c 2 -\/ln n/n, and e 2 is given below. 
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Theorem 3.2. (1) Suppose p = 0(n a ), for some a > 0. Let Assumption 1 hold and assume 
that there exists an e 2 such that 



(2) Suppose p = 0{exp(n)} and let Assumption 1 hold. Let r] 2 , n be given by ( [J. 7| j. If 113. 2\) 
holds with rj = 2rj2 with 5 = rj" 1 ^ + ei, then 



Remark 3.1. The term e 2 for the case withp = 0(n a ) is o(l) as long as r e (E) = o{n/ ln(n)}. 

Remark 3.2. Theorem 13.21 shows that the size of detectable jumps decreases as n increases. 
Thus, as expected, we can detect finer jumps from a larger amount of data. 

3.3 Sample eigenvalues and eigenvectors accuracy 

for reduced effective rank population covariance matrices 

Consistent selection of the relative rank is of interest in its own right, but in many instances 
one is also interested in the accuracy of the selected eigenvalues and eigenvectors. By Weyl's 
theorem (Horn and Johnson page 181]) we have \Xk — Xk\ < ||S n — E||2, for each k. 
Therefore, the results of Section I2T21 immediately yield finite sample bounds on the accuracy 
of all estimated eigenvalues . 

Corollary 3.2. Let r] min be either t]i ort]2 given in jjj3.4\ ) and Ii3. 5\) above. Let C > denote 
a dominating constant. Then, with probability larger than 1 — C/n, 




(3.8) 



Let T)i^ n be given by \3. 6]) . If Ii3. 2\) holds with rj = 2rji with 5 = rj 1 ?7i + t\, 





l<k<p 



max A fc - A fe < rj min . 



(3.9) 
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Therefore, whenever r] m i n = o(l) we have convergence of all the sample eigenvalues to the 
theoretical ones, at rate r] m i n , and the choice of the threshold level in the scree plot method 
does not influence the convergence rate. We show below that this is no longer true when 
one studies the sample eigenvectors. We do so by constructing a threshold level t] such that 



op(l), for all k < r(rj). We show below that rj is not necessarily equal to 77. 



and it is typically larger. An application of Lemma A.l in Kneip and Utikal |17| combined 
with the results of Section 12.21 immediately yields the following theorem. 

Theorem 3.3. Let Assumption 1 hold. Let r) m i n be given by either ft3.4\ ) or Ii3. 5\) . Let 
EG(T,) := {Ai,...,A p }, and assume that Xi > A 2 > ■ ■ ■ > X p . Then, with probability 
1 — 4n -1 , we have 



< 



for each k = 1, . . . , n Ap . 



min Ae £ G ( S ) jA ^ Afc I A — X k 



+ 



min Ae £ G(S ) iA ^ Afc I A - A fc | 2 



(3.10) 



The term 



is ideally equal to if the sample and theoretical eigenvectors 
coincide, and (13 . 1 j) quantifies precisely how close they are. Thus, whenever ||E n — E|| 2 
is small and, moreover, whenever the distance between the successive eigenvalues is larger 
than || S n — S||2, and are close. Therefore, an analysis of the number of accurate 
sample eigenvectors depends on assumptions on the spectrum of S. The problem becomes 
particularly challenging when the eigenvalues themselves are small, and so their differences 
are necessarily small. With a view towards the application to functional data presented in 
Section H] below, we consider population matrices of finite effective rank, and with bounded 
largest eigenvalue Ai = ||S|| 2 . Then, assuming that r e (S) = tr(E)/Ai < 00, for possibly 
diverging p implies that the eigenvalues X/ converge to zero with p. We further assume a 
certain type of decay for Xj, that appears naturally in Section H] and also allows for a clear 
illustration of the usage of Theorem 13.31 above in this context. 

Assumption 2. There exist two absolute constants C\\ and Ci\, and a constant (3 > 1 such 
that for all k: c\\k~^ <X^< C\\k~® . 
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When Assumption 2 holds, one can typically deduce the behavior of the difference between 
successive eigenvalues, which we however prefer to state below as a separate assumption, for 
clarity 

Assumption 3. There exist two absolute constants c 2 \ and C 2 \ and a constant Pi > (3 such 
that such that for all k: c 2 \k~ 131 < mmx^EGCs),X^x k |A — Afc| < C^fc - 

Notice that under Assumption 2, r) m i n = O {^Jhin/nj , irrespective of the regime of p relative 
to n. 

Theorem 3.4. Under Assumptions 1-3, and with r\ min = O (^y^lnn/n^j , let 

Vv = 0{( Vmin \nnf^}. (3.11) 



Then 
C>0 



o{l), for each k < k{rj v ), with probability higher than 1 — C/n, for some 



Remark 3.3. The proof of this theorem is immediate. We sketch it here for completeness. 



By Corollary 13.11 we have K\ =: k{rj) —r{rf) =: Ki, with high probability, for any r] > r] min . 
Under Assumption 2 above and by Definition 1 of Section 3.1 above, > rj(l + 5) and so 

Then, by display ( 13. 10p of Theorem 13.31 . combined with Assumption 3, we have, with high 
probability, 



71 

< K^TImin < Z^Jpi for each k ^ K l> 



V 



and the choice of rj above guarantees convergence to zero. A faster convergence rate can be 
obtained by replacing the Inn factor in the definition of rj by a larger sequence, with the 
implication that fewer eigenvectors will be selected. 

Remark 3.4. Notice that r) v > r] min , for rj v given in (14.91) . therefore r(rj v ) < r{j] min ) and, 
with high probability, also k(rj v ) < k{rj min ). As discussed above, r(r7 min ) is the maximum 
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relative rank of E that is detectable via S n . Display ( 13. 9 1) shows that, whenever ?7 m j n is 
appropriately small, all the sample eigenvalues, with indices k < k(r] m i n ), estimate well the 
population eigenvalues, whereas Theorem 13.41 shows that fewer eigenvectors ip k , correspond- 
ing to indices 1 < k < k(r] v ), are accurate estimators of the population eigenvectors i\) k . 
These results apply to any population matrix of bounded effective rank and with decay of 
its spectrum as in Assumption 2. In Section 14.21 below we show that the finite dimensional 
distributions of many stochastic processes of interest have finite dimensional covariance ma- 
trices with these properties and we use the results of this section to discuss the selection of 
the number of principal components in functional data. 

Remark 3.5. An immediate consequence of Theorem 13.21 is that Theorem 13.41 continues to 
hold with f] m in replaced by either rji n or ^ 2n , depending on the regime for p relative to n. 

4 An application to fPCA 

In this section we specialize our results to the analysis of sample covariance matrices con- 
structed from functional data. For this, let Xi(s),i = l,...,n, denote an i.i.d. sample 
of trajectories from a Gaussian process {X(t) : < t < 1}, with covariance function 
JC(s,t) = cov{X(s), X(t)}. We assume that we observe discretized versions of these tra- 
jectories, possibly corrupted by noise 

Yiftj) = ii(tj) + Xi{tj) + Eij, (4.1) 

where /x(-) is the mean function and are mean zero measurement errors that are indepen- 
dent of X{(-). We assume var(Eij) = a 2 is finite. As mentioned before we assume that all 
trajectories are observed at the same set of m points {0 < t\ < t 2 < ■ ■ ■ < t m _i < t m < 1} in 
[0,1]. We denote by ir m the projection mapping X(t) into an m-dimensional space MJ 71 , 
defined by ir m (X) = (X(ti), . . . , X(t m )). We refer to the distributions on W 71 induced 
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by 7r m as the finite-dimensional distributions of X. Let K = m 1 {fc(tj 1 ,tj 2 )}i<j 1 j 2 < m be 
the scaled covariance matrix corresponding to the m-dimensional distribution of X. Let 
Yi = {YM, Y(t m )}', Y(t) = n- 1 Yi(t) and Y = {F^), . . . , Y(t m )}'. To facilitate 
comparison with the results of the previous section we denote 

E = K + m~V 2 J m . 

An estimate of E is the sample covariance matrix 

i 

To keep our presentation focused, and to facilitate the immediate application of the results 
of the previous section to this case, we have employed the sample mean F as an estimator 
of the mean function of the process. For the scenario we study below, of densely sampled 
trajectories, the results that follow show that Y suffices. For more complicated sampling 
schemes, one would need to use a smooth estimator. Our results will carry over to this 
situation, but one would need to establish an appropriate equivalent of Theorem 12.11 which 
we defer to future work. 

We shall discuss in detail the quality of (13.11) given above as an estimator of the number of 
relevant eigenvalues and eigenfunctions. We make the following general assumption. 

Assumption A. fC(s,t) is continuous and a positive semi-definite kernel. 

Under Assumption A, Mercer's theorem guarantees that /C(s, t) admits the representa- 
tion YlkLi ^k^k^ipkit), where {Ai > A 2 > ■ ■ • > 0} are non-decreasing eigenvalues and 
{ipk(-), k = 1, . . . , } are eigenfunctions that are orthonormal in L 2 [0, 1]. Moreover, A& =: 
Ao < oo. 

In order to study how E n relates to /C and apply the results of section 2 and 3 above, we need 
to establish the relevant connections between /C and K. This is done in Proposition 14. 1[ of 
the next section, which allows us to show that r e (E) = 0(1). Therefore, we can specialize 
the results of Sections 12.21 and I3~31 to fPCA, and we present the analysis in Section |4~3"1 
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4.1 Finite approximations of eigenvectors and eigenvalues 



In this section we provide a deterministic analysis of the quality of K as an approximation 
to /C. We begin by stating the assumptions under which our results hold. 

Assumption B. sup fc sup sg r x i (^(s)) is bounded by a constant Co- 



Assumption C. dK ^. is continuous and f 



8K(t,t) 



at 



Assumption D. sup sg [ ji ( s ) < Cik 11 for all k where ip k L> (s) is the first derivative of ip k 
and Ci,7i are positive constants. 

Assumptions B - D bound the fluctuation of the eigenfunctions and are needed to show that 
the eigenvectors and eigenvalues of K are good approximations for JC(s,t). Note that the 
trigonometric basis satisfies Assumptions B - D. 

Assumption E. For all k we have: X k < C^/c -7 , for some constant C2 and 7 > max(7i, 1). 
Remark 4.1. For the Brownian motion all these assumptions hold, with 7 = 2 and 71 = 1. 

With slight abuse of notation, let ij) k = (^fc(ti), . . . , ipk{tm)Y- We also denote the eigenvalues 
of K by |Ai, A 2 , . . . I and the associated eigenvectors by V> 2 , . . . |. We denote by EG(JC) 
the spectrum of K. 

Proposition 4.1. If Assumptions A - E hold and if m is sufficiently large, such that 
m ( 1 -7)/(7+7i) < i/i2co, for Co given in jlA.lOfy . then we have 



dt is bounded. 



sup 

k>l 



1-7 

<C 3 m^i, (4.2) 



where C3 = CqC^/ (7 — 1) + Ci + 13coAo and also 

|tr(K) - A | < dm -1 , (4.3) 
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for some fixed positive constant C4, independent of m. Moreover, we have 



for all k < m l /^+^\ 



< 



C 3m (l-7)/(7+7i) 



min Ae£G(/c),A^A fe |A — Afc| 



(4.4) 



+ 6 



\mm.\£ E G(K.),\^\ k I A — A A 



+ 7c m (1 - 7)/(7+7l) , 



To the best of our knowledge the result in Proposition 14.11 is new. Whereas the global re- 
sult (14. 3 p is an immediate consequence of approximating integrals by finite sums, the deriva- 
tion of the bounds on the difference between individual eigenvalues and eigenvectors is much 
more involved, and depends crucially on the behavior of the spectrum and eigenf unctions of 
the covariance operator /C. The combination of f!4.2[) and (14.31) immediately yields the result 
below. 



Corollary 4.1. Under the assumptions of Proposition 4-1 , r e (K) = 0(1) andr e (E) = 0(1). 



This result shows that the finite dimensional distributions of processes with eigenvalues 
decaying as in Assumption D automatically have bounded effective rank. 



4.2 Detectable jumps in the spectrum of a covariance operator 

The concept of spectral jump introduced in Section 3.1 can be similarly defined for functional 
data. 

Definition 3 (Spectral jump). Let 77 > be given. We say that the covariance operator 
K,(s,t) has a spectral jump at index s if 

A s > (1 + 6)r), and A s+1 < (1 - 6)r), (4.5) 

for some 5 G (0, 1]. By slight abuse of notation, we denote an index s that satisfies (14.51) 
by r(rf). The following theorem shows that we can detect spectral jumps via a data driven 
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thresholding of the spectrum of E n . Since Proposition 14. II guarantees that the spectra of /C 
and K are close, the construction of these thresholds follows immediately from Theorem 13.21 

Theorem 4.1. Suppose that X(t) is a Gaussian process with a covariance function that 
satisfies Assumptions A - E. The assumption on m is the same as in Proposition ^ .1\ Let f] 2 . n 
be given by 7| ). If r] = 2r/ 2 satisfies (|^.5| ) with a 5 = ?7 _1 {?7 2 + C , 3m( 1_7 ^^ 7+7l ^ + m _1 cr 2 } + e 1 
where Cq is the constant in Proposition \4-l\ then 

P {fc(2r/ 2 , n ) = r(2r/ 2 )} > 1 - lOn" 1 . 

Remark 4.2. We have stated Theorem 14.11 in terms of 772 given by (13. 5p of Section 3.1 
above. From the results of Section 2.2, summarized in Table [TJ we recall that this is the 
optimal bound on ||E n — E||2, in the regime m = 0{exp(n)}. In this bound, the explicit 
dependence of m is removed from the bound, therefore m — > 00 is allowed. This facilitates 
the direct translation of our results to the ideal case of perfectly sampled trajectories, when 
m = 00. For each fixed m, the bound 771 given by H3.4[) and its estimator r] l n can also be 
employed, resulting in a sharper bound, but of the same order as 772- 

Remark 4.3. For processes with covariance operator satisfying the assumptions stated in 
Section 4.1 above, an application of Corollary 14.11 shows that either 771 or 7/2 are of order 
O ^A/ln n/n j . Therefore, Definition 3 above combined with the expression for 5 given in 
Theorem 14. 1 1 shows that, for every fixed m, the minimum order of magnitude of a recoverable 
eigenvalue is 

0{ v / W^ + m (1 " 7)/(7+7l) }. (4.6) 

This allows for a direct comparison between the results of Section 3.2, developed for co- 
variance matrices of bounded effective rank, and the results developed here, for covariance 
operators that have finite dimensional covariance matrices of bounded effective rank. The 
difference resides in the existence of the extra additive term m^ l ~ 1 ^' ci+ ' yi \ which quantifies 
the approximation error. 
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4.3 On the accuracy of the selected 

sample eigenvalues and eigenvectors for functional data 



We specialize the results of Section 13.31 to this context. For this, we first establish finite 
sample upper bounds for the sample eigenvalues and eigenvectors. 

Theorem 4.2. Suppose that X(t) is a Gaussian process with a covariance function that 



satisfies Assumptions A - E. The assumption on m is the same as in Proposition 4-1 Let 
C = max(m _1 o" 2 + c 2 Ao, C3) where c 2 is as in Theorem \2.3\ and C3 is as in Proposition ^. 1 
Define 

v / 2exp(l) + 8Vln2n Inn J^x . 

Vmin =■ C { * + + 7717+71 ), . (4.7) 

>n n 



Then with probability at least 1 — An 1 , the following holds for each k: 



At- — \k 



Furthermore, with probability at least 1 — An 1 , for each 1 < k < m 1 ^" /+ ' ri \ 

^ k -m- l/2 ij) k <— ^ — R - T , + 7c m^. (4.8) 

min AeSG (/ C ) >A ^ Afc | A — Afc | mm-\eEG(ic),\^\ k I A — Ak\ 

The proof of Theorem 14.21 follows directly from Proposition 14.11 and Theorem 13. 3[ hence the 
details are omitted. Note that we consider m _1 / 2 i/> fc in order to have it on the same scale as 



Remark 4.4. Theorem 14. 21 evaluates the accuracy of sample eigenvalues and eigenvectors as 
a function of both the sample size and the number of observations per subject. In particular, 
for the Brownian motion, 



Ai, — A; 



< \/ln n/n + m 1 ^ 3 , for each k 



with high probability. Reasoning as in Section 13. 3[ it also follows that all sample eigenvalues 
above fj min , or above an estimate of it, will also be close to the corresponding theoretical 
values, with high probability. 
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The accuracy of the sample eigenvectors also depends on how well separated the eigenvalues 
of /C are from each other. Reasoning exactly as in Theorem 13.41 of Section 13.31 we obtain the 
following result. 

Theorem 4.3. Assume the settings in Theorem \4-2\ hold. Furthermore, let Assumptions 2 
and 3 in Section [PI hold for the eigenvalues of K,. Then, with 

let 

Vv = 0{(fj min hin)W 1 }. (4.9) 



Then 



o(l), for each k < min < k(fj v ), m 1 ^ 7 " 1 " 71 ^ \, with probability higher than 



1 — C/n, for some C > 0. 

Remark 4.5. The proof is immediate, and identical to the one of Theorem 13.41 above. In 



light of Theorem 14. 11 the result above continues to hold when fj m i n is replaced by an estimate. 
For the Brownian motion — j — 2, 71 = 1 and fli = 3, resulting in 



Vn 



O {^J\nn/n + m 1/3 j and fj v = O {(77 lnn) 2/3 } . 



Reasoning as in Section 13.31 , we conclude that a thresholding level that is larger than the 
minimal fj m i n guarantees the accuracy of the sample eigenvectors. For the Brownian motion, 
the number of accurate sample eigenvectors is always upper-bounded by m 1//3 , but may be 
smaller, depending on the relative value of k(fj v ). 



5 Conclusions 

The following tables summarize our results. Table 2 shows that, when using the scree- 
plot type methods, with an appropriate, data driven, threshold level, one can consistently 
estimate the maximum number of population eigenvalues above ||S n — S||2. Since, with the 
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notation of Section 2.2, E* = E + (S* — E), we can regard the last term as a zero mean noise 
term, and a bound on its operator norm as the noise level in this problem. The results of 
Section 2.2 therefore allow us to obtain sharp bounds on the noise level, identify the largest 
detectable rank of E relative to this level and estimate it consistently. 



What 


Notation 


Where 


Order of smallest eigenvalue of E 
detectable via E n 


Vmin > || E n E||2 


Theorem 13.11 and Corollarv 13.11 


Largest detectable relative rank 




Theorem 13.11 and Corollarv 13.11 


Data driven thresholds 
for consistent estimation of r(rj m i n ) 


r}i, n and r] 2 ,n 


Theorem 13.21 



Table 2: Summary of the properties of the scree-plot method for relative rank estimation for 
general E 



What 


Order 


Threshold level 


Where 


Eigenvalues Afc — , 
for each 1 < k < p 


0(r) min ) = 0(y/logn/n) 


Any 


Corollary |3.2| 


Eigenvectors | \ipk — 4>k 1 
for each k < r(r) v ) < r(r] min ) 


o(l), under further 
assumptions on spectrum of E 


Vv ^* Vmin 

or data driven 


Theorem |3.4| 
Remark 3.5 



Table 3: Summary of the scree-plot method for eigenvalues' and eigenvectors' estimation 
when the population matrix has r e (E) = 0(1) 



Table 3 summarizes our results on the quality of the eigenvalues and eigenvectors retained 
by a scree-plot method that would be based on the minimal threshold value, just above the 
noise level rjmin- We obtained these results for the particular class of population matrices 
with r e (E) = 0(1), motivated in Section 3.2. We showed that whereas all sample eigenvalues 
would be accurate estimates for their population counterparts, a higher threshold level is 
needed for the convergence of the eigenvectors. We analyzed a similar phenomenon in the 
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functional data context, for which an analogue of Table 3 holds, with r] m i n replaced by fj m i n , 
r) v replaced by fj v , and all the corresponding results were presented in Section 4.3. 
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A Technical Proofs 

A.l Technical proofs of Section [2] 

First we establish two lemmas. 

Lemma A.l. Let X £ MP be a generic vector. Let A = {u = (ui, . . . ,u p )' £ M p : \m\ = 
■ ■ ■ = \u p \ = 1} . Then for any positive integer d, 



\X\\ 2d < l-J2(u'X) 2d . 



2p 



Poof of Lemma \A.1\ We write X )'. We have 



v 



B»'J0"= E Eil 2d III \ (A.i) 

ugA diH hd p =2dueA \\d\,...,d p J j=l 



It can be shown that 



E { ( 2d I U^ dj } = 



«£A 



\ d\, . . . , dp J j=i 
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if any of the d/s is odd. To see this, assume for simplicity that d\ is odd. Then 



3 

P P 



Y[(u jXj ) d i + (-u lXl ) dl HiUjX^ = 

3=1 3=2 



and (— Ui,v,2, ■ ■ ■ ,u p ) is also in A. It follows that equation ( lA.lj) becomes 



E(«'*) M = e E{| 2d \IH«m) 



ueA dH \-d p =du£A I \ 2di,...,2d p I j=l 



p 

2di 



E E|| M )IK 

di+-+d p =du€A I \ 2di, . . . , 2d p J j=l 



2P e < ( 2rf in 

<fi+-+dp=d I \ 2rfx, . . . , 2d p I j 



p 

2dj 



-1 



p 

2di 



e " iik 

d!+---+d p =d ^ y Ol, . . . , Up I j=l 

= 2 p \\X\\ 2d 

as desired. In the above derivation, we used the inequality 

2d \ I d 

2di, . . . , 2d p J y di, . . . , dp 

which can be easily verified. □ 

Remark A.l. In the following proofs we will assume sometimes, without loss of generality, 
that £ is a diagonal matrix. This can be immediately justified as follows. Consider the 
eigen-decomposition £ = ODO', where O is an orthonormal matrix and D is a diagonal 
matrix. Then cov(0'X) = D and ||X|| = ||0'X||. Similar arguments can be employed 
when we consider orthonormal transforms of matrices, and evaluate either their Frobenius 
or operator norm. 
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Lemma A. 2. Let X 6 MP be a zero-mean sub-Gaussian random vector that satisfies As- 
sumption 1. For any positive integer d, 



[2d) d 



E||X|| 2d < ^-[tr(E)] d . 



Poof of Lemma \A.2i Since ||AT|| 2d is invariant under orthonormal transforms, we assume 



that E is a diagonal matrix. By Lemma IA.lt 

E||X|| 2d < — ^E(u'X) 



2d 



where A = {u = (ui, . . . , u p )' EMP \ \u\\ = ■ ■ ■ = \u p \ = 1}. By Assumption 1, 



E(u'X) 2d < ^(u'T.uY = ^[tr(S)] d , 



where the last equality holds since we assume that X is a diagonal matrix. It follows that 

□ 



EL 



Lemma A. 3. Let X E MP be a zero-mean sub-Gaussian random vector and satisfies As- 
sumption 1. Then 



2 2fr(S) 

I* II II > ■ 



Co 

Poof ofLemma\AM Note that E||X|| fc < ^/E\\X\\ 2k for all k > 1. Hence by the definition 
of the sub- Gaussian norm and Lemma [A. 2 



|X||||^ = supfc" 1/2 (E||Xf ) l ' k 

k>l 

<su P r 1 / 2 (E||X|| 2fe ) 1 /( 2fc ) 



k>l 



< supfc- 1/2 (2A;) 1/2 v / tr(S)/co 

k>l 

< v/2tr(S)/co, 



which is the desired result. □ 
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A. 1.1 Proof of the statements in Examples 12.11 and 12.21 



Proof of the statements in Examples \2.1\ and \2.2l First we show that X in Example 12.11 is 
sub-Gaussian and satisfies Assumption 1. Let u G W be an arbitrary non-random vector. 
Then for any t > 0, 

Eexp(tu'X) = JjEexpOjXj) < JJexp < Uuj^/T^j a 2 /2 I = exp {t 2 {u'Zu)a 2 /2} , 

j=l j=l ^ J 

where the last equality holds because £ can be assumed to be diagonal matrix, without loss 
of generality. Hence u'X is sub-Gaussian and X is a sub-Gaussian random vector. The 
above inequality also implies 



Eexp Tiu\ < exp (tV/2). 



By Lemma 5 in Vershynin 



22], there exists a fixed constant Cq (depends only on a 2 ) such that 



(u'X) / yu/T/u < 1. By the linearity of the sub-Gaussian norm, we have Co||w'X|R < 



l/'2 



u'YjU as desired. 



Next, since u'X = (0'u)'X, similar to the derivation above, 

Eexp (tu'x) < exp {t 2 {0'u)'Z{0'u)a 2 /2] = exp {t 2 (u'OEO'u) a 2 /2] 



and (u'X)/y/u'OEO'u < 1, tehrefore c \\u'X\\ 2 < u'OZO'u. It follows that X is 
also sub-Gaussian and satisfies Assumption 1. □ 



A. 1.2 Proof of Proposition 12.11 



Proof of Proposition \2.1\ Let || • || ^ be the sub-exponential norm of a sub-exponential random 



variable (See Definition 5.13 of Vershynin 22(). We have 



W-M^IL^IIIIXH 2 ^ + ||tr(£)||, i 



< 2 1111X1 



l; 2 +tr(S) 
<tr(S)(4co 1 + l). 



(A.2) 
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In the second inequality above we used Lemma 5.14 of Vershynin 22] and for the third 
inequality we used Lemma [A. 31 above. Since ||^X_|J 2 — tr(E) is a zero-mean sub-exponential 



random variable, by Lemma 5.15 of Vershynin 
that if |t| > c||||X|| 2 -tr(S)||^, 



22| , there exist two fixed constants C, c such 



,|X|| 2 -tr(£)\ I ||||X|| 2 -tr(S)4 
Eexp <^ - — — } < exp { C 



t 



t 2 



Combining f ]A.2j) with the above inequality, we obtain the proposition. □ 



A. 1.3 Proof of Theorem 1231 



Proof of Theorem \2.1\ It is straightforward to verify that y/nX is sub-Gaussian and satisfies 
Assumption 1 with the same cq. Applying the Markov inequality to exp m||X|| 2 J we obtain, 
for any a > 0,t > c^c^ 1 + l)tr(E), 

P {n\\X\\ 2 - tr(S) >a}< exp(-ar 1 )Eexp {t' 1 [n\\X\\ 2 - tr(£)] } 



< exp(— at 1 ) exp < C 



(4CQ 1 + l)tr(E) 



where the last inequality holds by Proposition 12.11 By letting t = max (^\/C, cj (4c 1 
l)tr(S) and a = thin we obtain from the above inequality that 



+ 



Fln\\X\\ 2 -trCE) > max 



C,cj (4co 1 + l)lnn-tr(S)} <exp(l)/n 



which is the desired result. □ 



A. 1.4 Bounds on the Frobenius norm: Proof of Theorem 12.21 



The proo 
mirovski 



Theorem 12.21 consists in adapting a new powerful result in Juditsky and Ne- 



15( to our context and verifying that its hypotheses hold. For completeness, we 



state these results below. 
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Theorem A.l. Let (E,\\ \ ■ |||) be K-smooth with a norm ||| ■ ||| on E. Let {Z\, Z2, . . .} 
be E-valued, zero-mean and independent. Assume that there exists a sequence of positive 
numbers {cri, o~%, . . . } such that 

E{exp (^'lll^lll)} < exp(l),i = 1,2,.... 
Then for all n > 1 and t > 0: 



Z 1 + ■ ■ ■ + Z n > y/exp(l)K + t 



n n 



where t* n = I6y/Yn=i a 1l m axi<i<„ a t . 

Remark A. 2. Theorem \A.1\ is a special case of Theorem 4-1 in Juditsky and Nemirovski 
and the definition of a K-smooth space is on page 3 therein. 

f 1 

Theorem A. 2. Let 2 < p < 00. The Schatten norm \\Z\\ P = < ^ . [dj(Z)] p > on the space 
l mx " of m x n real matrices, where d\(Z) > d 2 (Z) > . . . are the singular values of Z, is 
K p (m,n) -smooth with 

K p (m,n) = min {max(2, p — 1)} {min(m, n)} 2 ^ p ~ 2 ^ p . 

2<p<oo,p<p 

Remark A. 3. Theorem ] A. S\ is Example 3.3 in Juditsky and Nemirovski For p = 2 we 
have the Frobenius norm which is K-smooth with k — 2. 

Proposition A.l. Let Z = XX' - E. Then E {exp \t~ x \\Zi\\ F ]} < exp(l), for any t > 
2 max (VC,c\ (Ac^ 1 + l)tr{E). 

Proof of Proposition \AJ\ First we have \\Z\\ F = \\XX' - E\\ F < \\XX'\\ F + ||E|| F = 
||X|| 2 + ||E|| F . It is easy to show that ||E||jr < tr(E). Hence 

E {exp [t -1 ||Z|| F ] } < exp {t" 1 [||E|| F + tr(E)]} E {exp [t^GlXf - tr(E))] } 

<exp{2t-V(E)}exp^ (4C ° 1 + 1)tr(S) ' 

< exp(l). 
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as desired if t > 2 max ^\/C,cj (4c 1 + l)tr(S). In the above derivation we used Proposi- 
tion O □ 



Proposition A. 2. For all n > 1 and t > 0: 

Cl 



p < us: - slip > 



^/2exp(l) + t ■ fr(E) | 

< 2exp < — — min(t 2 , IQty/n) 



n 



where c\ — 2 max Cs/C, cj (4c 1 + 1) . 



Proof of Proposition \A.2i By Theorem IA.2I the Frobenius norm is 2-smooth on the space 
M. pxp of p x p real matrices. Hence the theorem follows directly by Proposition IA.1I and 



Theorem IA.1I □ 



Proof of Theorem \2.S\ . The claim follows immediately from Proposition IA.2I by choosing 
t = 8V\n 2n. □ 



A. 1.5 Bounds on the operator norm: Proof of Theorem 12.41 

To derive the set of bounds on ||S n — E|| 2 presented in Theorem 12.41 we will appeal to the 
following result, which is adapted from Theorem 6.2 in Tropp j^lj]. 

Theorem A. 3. Let {Zi, i = 1, . . . , n} be a sequence of independent and identically dis- 
tributed symmetric matrices of dimension p. Assume that there exist positive quantities R 
and a such that 

E(Zi)=0 and \\E(Zf) || 2 < ^ ■ R d ~ 2 a 2 for d = 2,3,... (A.3) 

Then for all t > 0, with probability at least 1 — exp(— t), 

Z x + ■ ■ ■ + Z, 

n 



t + lnp t + lnp 
< 3 ■ max < a\ , R 



n n 
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The proof of Theorem 12.41 consists in the non-trivial verification of condition (1A.3I) . We 
do this in the following proposition and two lemmas. 



Proposition A. 3. Let Assumption 1 hold, and define Z = XX' — S, where S is the covari- 
ance matrix of X. Let c\ = sup d>1 exp(— d)d d /d\, c 2 = CiCQexp(— 1) + 5iexp(— 1)/4 + 3 and 
£3 = max {4exp(l)/co, 1}. If we let R = 2c 3 • tr(S) and a 2 = c 2 cl • ir(S) • ||£|| 2; then 

\\E(Z d )\\<^-R d - 2 a 2 ford = 2,3,... 
11 112 2 

Proof of Proposition \A.3\ Let u G W be a unit vector. First, by Lemma [A. 51 below. 
u'Z d u < \\ZWt 1 {u'{XX' + 2||S|| 2 • />} = \\ZWir 1 {{u'X) 2 + 2||£|| 2 } . (A.4) 
Next since \\Z\\ 2 < \\XX'\\ 2 + ||E|| 2 < ||X|| 2 + ||S|| 2 , we derive that 

\\zwir 1 < (\\x\\ 2 + ||s|| 2 ) d - 1 < 2 d - 2 (||x|| 2 ^ 1 ) + wnt 1 )- (A.5) 

Equations flA.4j) and flA.51) together imply 

u'Z d u < 2^ 2 (||Xf( d - 1 ) + IISH^ 1 ) {(u'X) 2 + 2||£|| 2 } . 

Hence 

E(u'Z d u) < 2 d - 2 E{||X|| 2(d - 1) (M / X) 2 + 2||X|| 2(d - 1) ||S|| 2 + ||S||^ 1 (m / X) 2 + 2||S||^}. (A.6) 

By Assumption 1, ||E|| 2 > c (E\\u'X\\ 2d ) 1/d /{2d) for any positive integer d, i.e., E||w'X|| M < 
c d {2d) d \\^\\ d 2 . Then 



ElWXf^iu'X) 2 } < JEHAl 4 ^- 1 ) -E{u'X) 



< A /E||X||^-i). c ^||S|| 2 



<4 Co ||E|| 2 JE||X||^-i). (A.7) 
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By Lemma [A. 21 above, we further derive from ( 1A.6I) and ( 1 A. 71) that 



E(u'Z d u) < 2 d ~ 2 {4c ||S|| 2 (4rf - 4) d - 1 tr(S) d - 1 /c^ 1 + 2{2d - 2) d - 1 tr(E) d - 1 ||E|| 2 /c^ 1 + 3||E|||} 

< 2 d ~ 2 { pl\\ 2 tr{T) d - 1 } {4c (4rf - A) d ~ l /c d ~ l + 2(2d - 2) d ~ l /c d ~ l + 3} 

< 2 d ~ 2 {\\T l \\ 2 tr{T l ) d ~ 1 } max{4exp(l)/c , l} d d\ {c lC gexp(-l)/d + 5 1 2- d exp(-l)/d + 3/d!} 

< 2 d - 2 c/!max{4exp(l)/c , l} d d\ { ||E|| 3 tr(E) d - 1 } {c^l exp(-l)/2 + ciexp(-l)/8 + 3/2} , 

where c\ = sup d>1 exp(— d)d d /d\. Because the above inequality holds for any unit vector u, 

\\E(Z d )\\ 2 <5 2 2 d - 3 C /!max{4exp(l)/c ,l} rf {||S|| 2 tr(S) d - 1 }, 
where c 2 = ciCgexp(— 1) + c\ exp(— 1)/4 + 3. The proof is complete. □ 

Lemma A. 4. Suppose A,B e M. pxp are two positive semi-definite matrices. Let ODO' be 
an eigendecomposition of A — B with D = diag(\i, . . . , X p ). Let D + = diag(\Xi\, . . . , |A P |). 
Then OD+O' < A + 2||I?|| 2 • I p , where the notation "<" was used to compare two matrices 
and for two matrices E\ and E 2 , E\ < E 2 implies E 2 — E\ is psd. 



Proof of Lemma A. 4 Let be the fc-th column of O, then A& = u' k (A — B)u^ > —\\B\\ 2 . 
This implies if A fc is negative, |A fc | < ||S|| 2 . Hence |A fc |-A fe < 2||S|| 2 and D+ < D + 2\\B\\ 2 -I P . 
It follows that OD+O' < 0(D + 2||B|| 2 ■ I p )0' = A-B + 2\\B\\ 2 -I P <A + 2\\B\\ 2 ■ I p . □ 

Lemma A. 5. Suppose A,B£ M. pxp are two positive semi-definite matrices. Fix u G W. 
For an arbitrary positive integer d, 

u'{A - B) d u < \\A - B\\ir l {u'{A + 2||£|| 2 • />} . 

Proof of Lemma \A.5[ Let ODO' be an eigendecomposition of A—B with D = diag(Ai, . . . , A p ) 
and define D+ = diag(|Ai|, . . . , \X P \). Let u = O'u. Then u'(A - B) d u = (O'u)' D d (0'u) = 
u'D d u = Ei=i^ 2 < max,- (A/- 1 • £J =1 \ X i\tf = W A ~ B \\t~ X {u'OD+O'u), i.e., u'(A - 
B) d u < \\A - Bff 1 (u'OD+O'u). By Lemma \KM OD+O' < A + 2\\B\\ 2 ■ I p and the proof 
is complete. □ 
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Proof of Theorem Q Let Z t = X { X[ - £, then E(Z t ) = 0. We derive that S* = 
n -1 Y^=i XiXl = rT x Yli=i Zi + 'E and hence ||E*— E|| 2 = X/^=i ^Ib- With Proposition 
IA.3I , the proof is then complete by applying Theorem IA.3I and by taking t — Inn. □ 



A. 1.6 Proof of Theorems 12.61 



Proof of Theorem \2.6[ Observe that tr(E n ) = tr(E*) + ||-X"|| 2 . By Theorem 12.14 with 
probability at most 3n _1 , ||X|| 2 > cilnn/n • tr(E). Therefore, it remains to show that 



P 



||tr(E;) -tr(E)| < 3 v / hWn-tr(E)| < 6n~ l . 



By the Markov inequality, if nt > C2tr(E), 

P{tr(E;) - tr(E) >a}< exp(-at _1 )Eexp {t [tr(£;) - tr(E)]} 

< exp(-at~ 1 ) {EexpjrT 1 ^ 1 [||X|| 2 - tr(E)] } }' 

■(4co 1 + l)tr(E)^ 2 



< exp(— at x ) exp < C 



s/nt 



where in the last inequality we used Proposition ETJ By letting t = C2\fntr(Jl) and a = thin 
we obtain from the above inequality that 

P |tr(£*) - tr(E) > c 2 ^lnn/n ■ tr(E)| < 3n _1 . 
With a similar argument we can obtain 

P {tr(E;) - tr(E) < -c 2 ^\nn/n- tr(E)} < 3n _1 
which completes the proof. □ 
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A. 2 Technical proofs of Section [3] 



A. 2.1 Proof of Theorem [37T1 

Proof of Theorem \3.1\ and Corollary \3.1\ To simplify notation, let r(rj) = s. We have 

P (k ^ < 1 - P (jfe < s) - P (fc > s) . 

First assume k < s. Then A^, +1 > A s . By the definition of k in (13. ip . X \ > rj and A^ +1 < r\. 
Hence A^ +1 — A^ +1 > A s — r\ > Srj. Similarly we derive if k > s, A^ +1 — A^ +1 < rj — X s+ i < Srj. 



By Weyl's theorem, 



X~ k+1 - X k+1 



< ||E n — S || 2 - Hence 

pjfcfa) = r(v)} > l-P(||Sn-S|| 2 > Srj) 

which finishes the proof of theorem. The proof of Corollary 13.11 is a straightforward combi- 
nation of Theorems I2.3| 12.51 and 13.11 □ 



A.2.2 Proof of Theorem EOl 

Proof of Theorem \3.2\ We shall only prove the first part as the proof of the second part can 
be obtained in an almost identical manner. To simplify notation we let fc(2r/i in ) = k\ and 
r e (2r)i) = si. By employing arguments similar to those used in the proof of Theorem 13. II we 
obtain 

P (ki ^ «i) < P(||E n - E|| 2 > mm{2 Vl , n - 2(1 - S)r] U (1 + 5)2 Vl - 2 Vl , n }) ■ (A.8) 
Define the event 

A = {(1 - ei)?7i < 77i,„ < (1 + ei)7fr}, 

where t\ = cilnn/n + c 2 ^J\rin/n. Then A = {|tr(S n ) — tr(S)| < eitr(E)} and hence by 
Theorem EH P(A C ) < 9n~\ Therefore, with IIP]) . 

P (*i ^ «i) < P (||E n - E|| 2 > 2(5 - e 1 ) Vl ) + P(A C ) 

<P(||£„-X:||2>»7i)+P(A C ). (A.9) 
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Next define the event 

£ = {||£ n -£|| 2 <e 2 ||£|| 2 }. 

By Theorem 12.51 and assumptions therein, P(B C ) < 4n _1 . It's easy to show that we can 
further write ( 1A.9I) as 

P (ki + si) < P (||E n - £|| 2 > Vi) + P(A C ) + F(B C ) < 17U- 1 . 

Looking into the proofs of the theorems in Section [21 we can actually reduce the number 17 
to 1 1 as we added the probability term in Theorem 12.11 three times. □ 



A. 3 Technical proofs of Section 
A. 3.1 Proof of Proposition 14.11 



Proof of Proposition \4-l\ First notice that Aq is the integral J K(t,t)dt, while tr(K) 



m 1 X]j=i tj) is a finite approximation to the integral. Hence equality (14 . 3 [) can be 
easily proved because of Assumption D. 

To prove (I4.2p and (I4.4p . we need some initial derivations. For the fixed design points 
{tj '■ 1 < j < rn}, denote by M > a constant such that M~ l m~ l < min <j< m \tj+i —tj\ < 
max <j< m \tj+i — tj\ < Mm~ l . Here to = 0,t m+ i = 1. By Assumptions D and E, we have 

|i// fcl i/> fc2 /m - S kuk2 \ < c max(fci, k 2 ) 11 /m (A. 10) 

for all ki and k 2 - Here Cq is a fixed constant and 5k lt k 3 equals 1 if k\ = k 2 and otherwise. 
Let \x] be the smallest integer that is no smaller than x. Define N = [m 1 ^ 7+71 ^] < m. Let 
A = . . . , if> N ] be an m x N matrix and let D = diag(Ai, . . . , A at). It follows that 

K = m" 1 = m^ADA' + m" 1 ^ 

k 

hence 



k>N 



|K - m -1 ADA' |L = m" 1 

1 1 r 



k>N 



p k>N k>N 
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By Assumption E, A& < C2&; 7 . Hence 



k>N 



V] A fc < / C 2 x 7 fix = - 



2 1-7100 

x N 

7 



7-1 ' 



Combining the results above with f lA.lOp . we obtain 
UK - m~ l ADA' 



■7 



k>N k>N ^ 



(A.11) 



where Co is an upper bound for all V'fc (see Assumption B). Next we study the term m~ x ADA. 
Consider a QR decomposition of A , where Q is an m x N matrix with orthonormal columns 
and R is an N x N upper-triangular matrix. Then ADA' = Q (RDR') Q' . Let Q and R be 
given as in Lemma [A. 61 below. We can further derive for all 1 < i, k < N, 

1 



„ 5c k^ < 5c iV 71 



m 



m 



and for all 1 < i, k,j<N with i ^ j, 



1 



— RikRjk 

m 



5c fc 71 5c iV 71 



m 



We let .D = RDR' and compute dj,- below. First 



da = ^ \ k R 2 ik = ^ ^k™ {R 2 k /m - 5 i>k (l + r;) 2 } + ^ A fc m5 iife (l + 



Kk<N 



and hence 



Furthermore, 



da - m\Al + Ti 



Kk<N 



< y A fc m^^ = 5c AoiV 71 . 

Kk<N 



(4/m-A,) 2 < (d ii /m-X i -2X i r i ) 2 + (2X i r i + X i r 2 ) 2 < 25\ 2 clN 2 ^ /m 2 + lM\ 2 c 2 N 2+2 ^ /m 2 . 



Next for i ^ j, 



\d 



t.n 



/] XkRikR. 



Ok 



< 5A c iV 71 . 
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It follows that 



m 



e(^-^) + e(^) 



JV 



< m" 2 ^ {25A 2 c 2 + 144A 2 c 2 iV 2 } N 2 ^ + m~ 2 ^ 25A 2 c 2 iV 271 



i=l 



< 169c 2 A 2 iV 2 + 2 ^/m 2 , 



and hence 



Im^ADA' -QDQ'l 



mr x b - D 



< 13c A iV 1+71 



Inequalities (lA.lip and (1A. 12[) together lead to 



|K - QDQ'\\ F < 



C^N 1 -^ 13c n A n iV 1+71 



7-1 



rn 



(A.12) 



(A.13) 



Now we are ready to prove ( 14. 2ft and ( 14.4ft . First we invoke Weyl's Theorem (Horn and 
Johnson [12], page 181]), to obtain, for each k, 



At — Ai 



< \d k (K) - d k {QDQ')\ + l {k>N} X k 



< ||K — QDQ'\\ 2 + l{fc>Ar}A fc 
C 2 ^ 1 " 7 13c A A^ 1+71 



< 



7-1 

< C 3 m (1 ^ )/(7+7l) , 



??? 



+ C 2 N~ 



(A.14) 



where C 3 = CqC 2 / (7 — 1) + C 2 + 13c A is a fixed constant and recall that = [m 1 ^ 7+71 )] . 
Since the upper bound in the above derivation does not depend on k, we obtain (14. 2p . 



Finally we prove (14. 4p . As in Lemma [A. 61 below, we denote the columns of Q by vi, . . . , vjv- 
Then for 1 < k < N, rf) k = Y^=i RkjVj- It follows that 



\m 



k k 

- 1/2 ii> k ~ v fc || < \ m ~ 1/2 Rk 3 - S kJ \ <\n\ + J2 Zcof 1 /™ < 7c k 1+ ^/m < 7c iV 1+7 Vm. 

3=1 3=1 

(A.15) 
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Next by Lemma A.l in Kneip and Utikal [17] (see also inequality (A. 6) of Kneip and Sarda 



16|), we obtain from (IA. 13[) that 

C 3m (l-7)/(7+7i) 



rnin Ae£ ; G (/ C ) >A ^ Afe |A — A&| 



6 



C 3m (i-7)/(7+7i) 



{ min AGi?G(/C),A^A fc 



A -A, 



(A.16) 



Inequalities ( 1A.15[) and (1A.16|) together gives (14.41) which completes the proof of this propo- 
sition. □ 

Lemma A. 6. Suppose the assumptions in Proposition \4-l\ hold. Let A = . . . ,if> N ] be 
an m x N matrix where i\) k = {ipk{ti), ■ ■ ■ ,ipk{t m ))'- Let (Q,R) be a QR decomposition of A 
where Q is an m x N matrix with orthonormal columns and R is an N x N upper-triangular 
matrix. Denote the (k,j)th element of R by Rkj. Let N be a positive integer such that 
12cqN 1+Tl < m where c is the constant as in inequality liA.10\) . If A has full rank, then 
there exists a pair of Q and R such that if k > j , R k j = and if k < j , 

1 



Rkj — Ofe,j — Vk,j r k 



< 3c j 71 /V 



where r k is defined in such a way that for all k < N 



N < 4c A; 1+7 V 



m. 



Proof of Lemma \A.b\ We construct Q and R by the Gram-Schmidt process. Let ui 



V>i,Vi = ui/HuJ. For k = 2, . . ., N, define u k = ^ k - ^jLj^v^Vj, v fe = u fc /||u fc || and 



r k = \A"/ll u fell — We let Q = [vi, . . . , v N ] and R = Q'A. Denote R k j the (k, j)ih element 
of R. Then R k j = i/>jVk- Note that {vi, . . . , vtv} are orthonormal vectors and i\) k can be 
written as a linear combination of {v 1; . . . , v^}. Hence for k > j, R k j = and for k < j, 
Rkj = (^Ufc)/||u fc ||. Because 



Rkj — 0~k,j — ^k,jf k 



m 1 il>' j u k - 8 k ,j 



u k 



771 



the lemma is proved if we can show that for all k < N 



\u k \\/V^-l\ <3c /c 1+71 / 



711 



(A.17) 
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and 

\ijjjUk - m5 j>k \ < 2c j 71 for all j with k < j < N, (A.18) 

where Co is the fixed constant in (lA.lOj) . In particular, the inequality \rk\ < 4coA; 1+7l /m can 
be proved by ( 1A.17I) . 



We prove equalities flA.17j) and (IA.18j) by induction on the range of k. For k — 1, Ui = 



Hence ||ufc|| = \\tpi\\ and flA.17j) holds for k — 1 by inequality flA.lOj) . Inequality flA.18j) for 



k = 1 can also be proved by (lA.lOj) . Assume HA.17h and (1A.18I) hold for all k < N - 1 with 
A^o < N, then we need to prove that 

\\\u No \\/Vm~- l| < 3c A^ 1+7l /m (A.19) 

and 

| Vj-ujvo - m5 ij7Vo | < 2c j 71 for all j with N Q < j < N, (A.20) 
We first prove flA.191) . By definition of ii/v , 

iV °~ 1 N °~ 1 I-!// 11, I 9r M 1+ ^ 

II / II ^ li' I ^ ^JVo Ufc ^ ^C iV 

- <M < E i^i <E^< ^ (1 _ 3caN ^ /m) 

by the induction assumption. Since | \\ift No || 2 — m| < cqNq 1 by inequality (lA.lOj) . we can 
derive I HV'jVnll — V^l — — / C ° N ° an d hence 

1 1 2y/m-c N 1 

|I|utv () || - \/m\ < |||uiVoll - IIV'tvoIII + lllV'voll - V™\ 

2c iV 1+71 c N^ 



< 



fm{\ - 3c A^ 1+7l /m) 2y/m - c N^ ' 
It follows that 

, r- , I . 2c iV 1+71 c iV 71 1+71 

Ujv n /vm — 1 < t~t < 3c N n n /m 

by the assumption that 12c A rl+71 /m < 1. So we have proved (1A. 19[) . 



We now prove (TA~2U]) . Note that for j > N , 



N -l 



k=l 
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By the induction assumption, for k < N — 1, \if)' No Vk\ = \if)' No u k \ / \\u k \\ < ScqNq 1 / <^m 
by the assumption on N. Similarly, |?/^v fc | < 3c j 71 / ' \fm. Hence |i^'ujv — ^'j^Nol 
9c^ 1+7l j 7l /m < cof' 1 3/4 and 



< 



l^-ujvo - m8 j>No \ < - V^^/J + \ip'jipN ~ mS j,N \ < c oj 7l 3/ 4 + c < 2c j 

which proves (IA.20I) and also the lemma. □ 

A.3.2 Proof of Theorem 13711 

Proof of Theorem \4-l\ Similarly to the proof of Theorem 13.11 we can derive that 



71 



P{&(277 2 ,„) ^r e (27/ 2 )} < P(||£ n -£|| 2 > 2min{r/ 2 , n -4 + i(S),4(S) -r/ 2 ,4) 
Next we use the same argument as in the proof of Theorem 13.21 Define the event 

A = {(1 - £1)772 < V2,n < (1 + €1)7/3}, 



where e\ = C\\a.n/n + c 2 ylnn/n. Then A = {|tr(S n ) — tr(S)| < eitr(E)} and hence by 
Theorem EH P(A C ) < 9n^. It follows that 

p{A;(2r/ 2in ) ^r e (2r/ 2 )j 

1 J (A.21) 

<P (||E n - E|| 2 > 2 min{(l - e 1 )r l2 - d s (E) - (1 + e 1 )r} 2 }) + ¥(A C ) 

Note that d s (E) = A s +m"V 2 , d s+1 (T,) = \ s+1 +m~ l a 2 and also sup fc | A fc — A fc | < C 3 m^-^/^ + ^l 
Moreover by assumption A s > (1 + 5)rj and A s+ i < (1 — S)r). In light of all the above we 
obtain, 

(1 - ei )7/ 2 - > (1 - e 1 ) V2 - \ s+1 - C 3 m^y^ - m~ l a 2 

> r/2 {(1 - ei) -(1-5)- 7/ 2 - 1 (C 3 m( 1 - 7) /(7+7i) + m -V)} 

> V2 
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and similarly d s (E) — (1 + 61)772 > ^2/2. Therefore inequality (IA.21I) becomes 
P {&(27fc, n ) ^ r e (2r/ 2 )} < P (||E n - E|| 2 > m ) + ¥(A C ) < lOn" 1 
which proves the theorem. □ 
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